Fake News Detector

Data Exploration

Using Amazon's SageMaker for Train | Deployment


This project is about a classification model that examines a text file from news and performs binary classification; labeling that news as either fake or real. The model was trained using a dataset from kaggle. The dataset consists of about 21000 news labeled as true and about 23000 news categorized as fake.

The project is inspired from a research that suggests both liberals and conservatives are motivated to believe fake news, and dismiss real news that contradicts their ideologies. Fake news spread through social media has become a serious problem and this study aims to build an unbiased model that could detect news as fake or real.

The first step in working with any dataset is loading the data in and noting what information is included in the dataset. This is an important step in eventually working with the data, and knowing what kinds of features we have to work with as we transform and group the data. So, this notebook is all about exploring the data and noting patterns about the features we are given and the distribution of data.

General Outline

  1. Import Libraries
  2. Read in the Data
  3. Prepare and Process the Data
  4. Check Data
  5. Explore the Data
    1. Distribution of the Classes
    2. Distribution of the News Topics
    3. Distribution of the News Topics and Classes
    4. Explore Date
    5. Explore Text
    6. Explore Words
      1. Word Clouds in Fake and True News
      2. Most Frequent Words in Fake and True News
  6. Next Notebook

Import Libraries

In the next cells we will import the required libraries for this analysis and set some global variables.

In [1]:
# Install libraries.
# !pip install chart_studio
# !pip install wordcloud

# Import libraries.
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from PIL import Image
import chart_studio.plotly as py
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio

import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from wordcloud import WordCloud

import re
from bs4 import BeautifulSoup
from tqdm import tqdm
from collections import Counter

# Set Plotly theme.
pio.templates.default = "gridon"

# Set global variables.
RANDOM_STATE = 5
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/perseas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/perseas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Go to the top

Read in the Data

The cell below will load the data into pandas dataframes.

Acknowledgements for data:

  • Ahmed H, Traore I, Saad S. “Detecting opinion spams and fake news using text classification”, Journal of Security and Privacy, Volume 1, Issue 1, Wiley, January/February 2018.
  • Ahmed H, Traore I, Saad S. (2017) “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127-138).

The dataset is made of multiple text strings and other characteristics summarized in csv files named Fake.csv and True.csv, which we can read in using pandas.

In [2]:
true = pd.read_csv("data/True.csv")
fake = pd.read_csv("data/Fake.csv")

# Show first rows for each dataset.
display(true.head())
display(fake.head())

# Print the number of real and fake news.
print('\nThere are {} real and {} fake news'.format(true.shape[0], fake.shape[0]))
title text subject date
0 As U.S. budget fight looms, Republicans flip t... WASHINGTON (Reuters) - The head of a conservat... politicsNews December 31, 2017
1 U.S. military to accept transgender recruits o... WASHINGTON (Reuters) - Transgender people will... politicsNews December 29, 2017
2 Senior U.S. Republican senator: 'Let Mr. Muell... WASHINGTON (Reuters) - The special counsel inv... politicsNews December 31, 2017
3 FBI Russia probe helped by Australian diplomat... WASHINGTON (Reuters) - Trump campaign adviser ... politicsNews December 30, 2017
4 Trump wants Postal Service to charge 'much mor... SEATTLE/WASHINGTON (Reuters) - President Donal... politicsNews December 29, 2017
title text subject date
0 Donald Trump Sends Out Embarrassing New Year’... Donald Trump just couldn t wish all Americans ... News December 31, 2017
1 Drunk Bragging Trump Staffer Started Russian ... House Intelligence Committee Chairman Devin Nu... News December 31, 2017
2 Sheriff David Clarke Becomes An Internet Joke... On Friday, it was revealed that former Milwauk... News December 30, 2017
3 Trump Is So Obsessed He Even Has Obama’s Name... On Christmas day, Donald Trump announced that ... News December 29, 2017
4 Pope Francis Just Called Out Donald Trump Dur... Pope Francis used his annual Christmas Day mes... News December 25, 2017
There are 21417 real and 23481 fake news

Go to the top

Prepare and Process the Data

It is more convenient to merge the two datasets and then apply any processing tasks and text transformations to the new dataframe. First, we need to create a new column label to save the labels of each text. We will also shuffle the final dataframe and then delete the previous datasets to free up some memory.

In [3]:
# Create the 'label' column.
true['label'] = 'True'
fake['label'] = 'Fake'

# Concatenate the 2 dfs.
df = pd.concat([true, fake])

# To save a bit of memory we can set fake and true to None.
fake = true = None

#  Shuffle data.
df = df.sample(frac=1, random_state=RANDOM_STATE).reset_index(drop=True)

# Show first rows.
df.head()
Out[3]:
title text subject date label
0 Supreme Court sympathetic to property owner in... WASHINGTON (Reuters) - The U.S. Supreme Court ... politicsNews March 30, 2016 True
1 LAWS ARE FOR THE COMMON MAN…NOT FOR BARRY SOET... It was all an accident. It s good to King Th... left-news May 12, 2015 Fake
2 Turkey 'appalled' by U.S. stance on IS withdra... (This November 15 story was corrected to clar... worldnews November 14, 2017 True
3 Germany extends passport controls on Austrian ... BERLIN (Reuters) - Germany has extended tempor... worldnews October 12, 2017 True
4 Ted Cruz: I’m ‘Honored’ By Support Of This De... Ted Cruz is a dangerous extremist who regularl... News April 10, 2016 Fake

Go to the top

Check Data

The dataframe will be examined for the quality of its data. The types and shape of the data will be checked, as well as if there are any missing records.

In [4]:
# Check df.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   label    44898 non-null  object
dtypes: object(5)
memory usage: 1.7+ MB

Inference

  • There are no missing values.
  • Dates were recognized as strings. Later, we will use a pandas date parser function to recognize dates correctly.

Go to the top

Explore the Data

Next, let's look at the distribution of data.

Distribution of the Classes

We need to check how evenly is our data distributed among the two classes.

In [5]:
# Show counts for each class.
fig = px.bar(df.groupby('label').count().reset_index(), x='label', y='title', text='title', opacity=0.6)
fig.update_layout(title_text='Distribution of News')
fig.update_xaxes(showgrid=False, title_text=None)
fig.update_yaxes(showgrid=False, title_text=None)
fig.update_yaxes(showticklabels=False)
fig.show()

Inference

  • It seems that the dataset is balanced, so no actions needed to handle any imbalances between the classes.

Go to the top

Distribution of the News Topics

It may also be helpful to look at the subject distribution.

In [6]:
# Show counts for each class.
fig = px.bar(df.groupby('subject').count()['title'].reset_index().sort_values(by='title'),
             x='subject', y='title', text='title', opacity=0.6)
fig.update_layout(title_text='Distribution of News Subjects')
fig.update_xaxes(showgrid=False, title_text=None)
fig.update_yaxes(showgrid=False, title_text=None)
fig.update_yaxes(showticklabels=False)
fig.show()

Inference

  • We have 5 categories with a lot of True and Fake news and 3 with only a few hundres.

Go to the top

Distribution of the News Topics and Classes

Let's dig deeper and see the distribution of labels inside each subject.

In [7]:
df_sum = df.groupby(['label', 'subject']).count().reset_index()
fig = px.bar(df_sum, x='label', y='title', color='subject', text='title', opacity=0.6)
fig.update_xaxes(showgrid=False, title_text=None)
fig.update_yaxes(showgrid=False, title_text=None)
fig.update_yaxes(showticklabels=False)
fig.show()

Inference

  • It seems that the Fake and Real news datasets do not contain the same topics in the subject category. Possibly, politics is similar to politicsNews, but since subjects are mapped differently between True and Fake news, it would be better to remove this feature from our model.

Go to the top

Explore Date

It would be interesting to see if there are any patterns in date.

In [8]:
# Convert date str into date object. Take care of any errors for invalid dates.
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df_date = df.groupby(['label', 'date'])['title'].count().reset_index()

fig = px.line(df_date, x='date', y='title', color='label')
fig.update_xaxes(title_text=None)
fig.update_yaxes(title_text=None)
fig.update_layout(legend_title_text=None)
fig.show()

Inference

  • Not too much to infer from this time series plot.
  • Let's extract the week of the year, month of the year and other date features to check if there is any seasonality. It would be wise to compare the date features at the same window. It seems that the more True news observed after August 2017 is not because of the date. For the same reason, the lack of True news before January 2016 is not due to the date. In other words, we cannot infer that all the news before January 2016 were Fake.
In [9]:
# Filter df based on date.
df_filtered = df[(df['date'] < '2017-08-31') & (df['date'] > '2016-02-01')].copy()
df_filtered.loc[:, 'weekday'] = df_filtered['date'].dt.dayofweek
df_filtered.loc[:, 'week'] = df_filtered['date'].dt.weekofyear
df_filtered.loc[:, 'month'] = df_filtered['date'].dt.month
df_filtered.loc[:, 'quarter'] = df_filtered['date'].dt.quarter

df_weekday = df_filtered.groupby(['label', 'weekday']).count()['title'].reset_index()

fig = px.line(df_weekday, x='weekday', y='title', color='label')
fig.update_layout(title_text='Day of Week')
fig.update_xaxes(title_text=None)
fig.update_yaxes(title_text=None)
fig.update_layout(legend_title_text=None)
fig.show()
In [10]:
df_week = df_filtered.groupby(['label', 'week']).count()['title'].reset_index()

fig = px.line(df_week, x='week', y='title', color='label')
fig.update_layout(title_text='Week of the Year')
fig.update_xaxes(title_text=None)
fig.update_yaxes(title_text=None)
fig.update_layout(legend_title_text=None)
fig.show()
In [11]:
df_month = df_filtered.groupby(['label', 'month']).count()['title'].reset_index()

fig = px.line(df_month, x='month', y='title', color='label')
fig.update_layout(title_text='Monthly')
fig.update_xaxes(title_text=None)
fig.update_yaxes(title_text=None)
fig.update_layout(legend_title_text=None)
fig.show()
In [12]:
df_quarter = df_filtered.groupby(['label', 'quarter']).count()['title'].reset_index()

fig = px.line(df_quarter, x='quarter', y='title', color='label')
fig.update_layout(title_text='Quarterly')
fig.update_xaxes(title_text=None)
fig.update_yaxes(title_text=None)
fig.update_layout(legend_title_text=None)
fig.show()

Inference

  • There is no clear distinction between date features in Fake and True news.

Go to the top

Explore Text

Let's print a couple of True and Fake examples to check if there are any differences.

In [13]:
print('Fake News\n')
print(df[df.label == 'Fake']['text'].tolist()[3])
print()
print(df[df.label == 'Fake']['text'].tolist()[5])
print()
print('\n\nTrue News\n')
print(df[df.label == 'True']['text'].tolist()[0])
print()
print(df[df.label == 'True']['text'].tolist()[2])
Fake News

If it were up to this  President,  we wouldn t have any arms to bear U.S. Citizenship and Immigration Services on Tuesday said it will no longer require incoming U.S. citizens to pledge that they will  bear arms on behalf of the United States  or  perform noncombatant service  in the Armed Forces as part of the naturalization process.Those lines are in the Oath of Allegiance that people recite as they become U.S. citizens. But USCIS said people  may  be able to exclude those phrases for reasons related to religion or if they have a conscientious objection.USCIS said people with certain religious training or with a  deeply held moral or ethical code  may not have to say the phrases as they are naturalized.The agency said people don t have to belong to a specific church or religion to use this exemption, and may attest to U.S. officials administering the oath that they have these beliefs.Via: Washington Examiner

21st Century Wire says This will rank as one of the most egregious miscarriages of consumer trust in US history. In a brief statement to Reuters, according to these Silicon Valley executives: Yahoo is a law-abiding company, and complies with the laws of the United States.  Actually, Yahoo is in breach of the law. The  laws  they think they are talking about do not allow for flipping a  dead switch  on the 4th Amendment and conducting mass surveillance on all US citizens.In reality, the US Constitution is meant to be the supreme law of the land and Yahoo just colluded with an out of control US federal government   to trash the 4th Amendment.So who is Marissa Mayer, the Yahoo CEO who willingly colluded behind closed doors with the NSA and the FBI to submit all of Yahoo s customer base to unwarranted surveillance? Why was this 37 year old neophyte allowed to walk into the top spot from Google with no real managerial experience, and put in charge of such a large corporation   and then given a $54 million golden parachute after this dark spying debacle? She has quickly amassed net worth of approximately $500 million   half a billion dollars.It sounds like she was definitely  hand-picked  for this particular job  SELL OUT: Yahoo CEO Marissa Mayer sold out her customer base to an out of control, data-hungry Federal government.The GuardianYahoo last year secretly built a custom software program to search all of its customers  incoming emails for specific information provided by US intelligence officials, sources have told Reuters.The company complied with a classified US government directive, scanning hundreds of millions of Yahoo Mail accounts at the behest of the National Security Agency (NSA) or FBI, according to two former employees and a third person who knew about the program.Some surveillance experts said this represents the first known case of a US internet company agreeing to a spy agency s demand by searching all arriving messages, as opposed to examining stored messages or scanning a small number of accounts in real time.It is not known what information intelligence officials were looking for, only that they wanted Yahoo to search for a set of characters. That could mean a phrase in an email or an attachment, said the sources.Reuters was unable to determine what data Yahoo may have handed over, if any, and whether intelligence officials had approached other email providers besides Yahoo with this kind of request.According to the two former employees, Yahoo CEO Marissa Mayer s decision to obey the directive troubled some senior executives and led to the June 2015 departure of the chief information security officer, Alex Stamos, who now heads security at Facebook.( ) Andrew Crocker, staff attorney with the Electronic Frontier Foundation, said that the use of the word  directive  to describe the program indicated that the request may have been ordered under the section 702 of the 2008 Fisa Amendments Act, which allows the government to target non-US citizens abroad for surveillance.Revelations by Edward Snowden about the Prism and Upstream programs   of which the Yahoo program looks like a hybrid, Crocker said   show that US citizens were also subject to mass surveillance. The fourth amendment and attendant privacy concerns are quite staggering,  Crocker said.  It sounds like they are scanning all emails, even inside the US   the fourth amendment protects that fully. It s hard to see how the government justifies requiring Yahoo to search emails like that; there is no warrant that could possibly justify scanning all emails Continue this story at The GuardianREAD MORE SCI-TECH NEWS AT: 21st Century Wire Sci-Tech Files 



True News

WASHINGTON (Reuters) - The U.S. Supreme Court on Wednesday appeared likely to rule that property owners can challenge the federal government in court over the need for permits under a national water protection law in a case involving a company’s plans for a Minnesota peat mine. The court heard a one-hour argument in a case balancing property rights and environmental law, in this instance the landmark 1972 U.S. Clean Water Act. A majority of the eight justices appeared sympathetic toward North Dakota-based Hawkes Co Inc, which is fighting an Obama administration finding that its property includes wetlands. The law mandates that property owners get permits in such situations. Whether a particular plot of land falls under the law’s jurisdiction is important to developers and other property owners because such a finding triggers a lengthy and expensive permitting process. Hawkes’ lawyers argued the company should be able to contest whether it even needs to go through the permit process. Liberal and conservative justices alike expressed concern about the current arrangement’s burden on property owners. Conservative Chief Justice John Roberts said applicants who disregard a government finding that they need a permit do so at “great practical risk.” Liberal Ruth Bader Ginsburg called the process “very arduous and very expensive.” Liberal Stephen Breyer called the government decision that Hawkes needed a permit “perfectly suited for review in the courts.” Only liberal Elena Kagan expressed support for the government, raising concerns about the impact a ruling favoring property owners would have on actions by other government agencies such as the Securities and Exchange Commission. Property rights advocates said the permitting process can take two years and cost up to $270,000, with owners facing penalties of up to $37,500 a day for noncompliance. Business groups including the National Association of Home Builders and the U.S. Chamber of Commerce and 29 states filed court papers opposing the Obama administration in the case. The case follows the justices’ unanimous 2012 ruling that property owners facing enforcement action under the Clean Water Act can ask a court to intervene before being forced to comply or pay financial penalties. The Obama administration last year issued a new regulation defining the scope of federal jurisdiction over bodies of water. A federal appeals court put the rule on hold after it was challenged by 18 states. Only eight justices participated in the case following Justice Antonin Scalia’s February death. A ruling is due by the end of June. 

BERLIN (Reuters) - Germany has extended temporary passport controls on its border with Austria and for flights coming from Greece for six more months amid continuing concerns about attacks and illegal immigration, the Interior Ministry said on Thursday. The European Commission last month offered initial backing to a Franco-German proposal to allow more permanent border checks within the bloc s Schengen free-travel zone. Germany notified the European Commission, the Council of the European Union, the President of the European Parliament and the interior ministers of the EU-Schengen states about its decision in a letter on Wednesday. The commission allowed Germany to control its border with Austria in 2015 after a surge of refugees and migrants began arriving in western Europe. The temporary permission was due to expire on Nov. 11. Immigration will be a key issue in coalition talks by Chancellor Angela Merkel s conservatives, the pro-business Free Democrats and the environmentalist Greens.  Interior Minister Thomas de Maiziere justified Germany s move given continued security concerns, and said a complete return to free travel within the Schengen area was contingent on an improvement in the overall situation. Germany was hit by five Islamist attacks in 2016, including one in December on a Berlin Christmas market that killed 12 people.  We are working hard on this, all member states, the EU Commission and the EU Council, but there is still a long way to go,  de Maiziere said in a statement. Border controls also apply to flights to Germany from Greece, which was the gateway for thousands of migrants from Iraq and Syria. Ministry data on Wednesday showed that 140,000 asylum seekers have registered in Germany since the beginning of 2017. 

Inference

  • Maybe True news contain more dates, numbers, and names than Fake ones. Later, we will extract these features and use them as input in the models.

Go to the top

Explore Words

It is interesting to see what kind of words and phrases are commonly used in Fake and True news. Before this exploration, a good strategy is to clean the text applying the following steps:

  • Read a text file as a string of raw text.
  • Lower case all words, so that captialization is ignored (e.g., IndIcaTE is treated the same as Indicate).
  • Normalize numbers, replacing them with the text number.
  • Remove non-words, remove punctuation, and trim all white spaces (tabs, newlines, spaces) to a single space character.
  • Tokenize the raw text string into a list of words where each entry is a word. Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded. The tokens become the input for another process like parsing and text mining. Then the words will be ready to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).
  • Use lemmatization or stemming to consolidate closely redundant words. For example, "discount", "discounts", "discounted" and "discounting" will be all replaced with "discount". Sometimes, the Stemmer actually strips off additional characters from the end, so "include", "includes", "included", and "including" are all replaced with "includ".
  • Remove stopwords. Stop words are so frequently used that for many tasks (but not all) they don't carry much information. Examples are "any", "all", "what", etc. NLTK has an inbuilt corpus of english stopwords that can be loaded and used.
  • Apply additional text preparation steps, such as normalizing links and emails: All https and http links will be replaced with the text "link" and all emails will be replaced with the text "email".
  • Render either a word cloud or a bar chart with the most frequent unigrams, bigrams, trigrams, etc.

Word Clouds in Fake and True News

First, we can use directly the WordCloud module before doing any heavy text processing, so as to get an idea of the most important words or phrases in fake and true news.

In [14]:
# Create a function to create a word cloud.
def make_wordcloud(text, mask, color):
    wordcloud = WordCloud(max_words=200, mask=mask,
                          background_color='white',
                          contour_width=2,
                          contour_color=color).generate(text)
    plt.figure(figsize=(17,12))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.show()

# Read an image in order to use it as a shape for our word cloud.
fake_mask = np.array(Image.open("data/fake.png"))
true_mask = np.array(Image.open("data/true.png"))

# Get the fake and true news.
fake_text = " ".join(text for text in df[df.label == 'Fake']['text'])
true_text = " ".join(text for text in df[df.label == 'True']['text'])

# Render word clouds.
make_wordcloud(fake_text, fake_mask, 'blue')
make_wordcloud(true_text, true_mask, 'orange')

Inference

  • Fake news contain a lot of words like Donald Trump, Hillary Clinton, White House, and United States.
  • True news contains a lot of the words found in fake, but also contains a lot of dates like on Tuesday, on Monday, on Sunday, last week, etc.

Go to the top

Most Frequent Words in Fake and True News

Let's apply the text processing steps described before and then print the most frequent words.

In [15]:
# Create a new 'tqdm' instance to time and estimate the progress of functions.
tqdm.pandas()

# Create a function to clean and prepare text.
def clean_text(text):
    """ Remove any punctuation, numbers, newlines, and stopwords.
    Convert to lower case.
    Split the text string into individual words, stem each word,
    and append the stemmed word to words. Make sure there's a single
    space between each stemmed word.
    Args:
        text: text, string
    Returns:
        words: cleaned words, list
    """
    
    # Replace numbers with the str 'number'.
    text = re.sub('\d+', 'number', text)
    
    # Replace newlines with spaces.
    text = re.sub('\n', ' ', text)
    
    # Replace punctuation with spaces.
    text = re.sub(r'[^\w\s]', ' ', text)
    
    # Remove HTML tags.
    text = BeautifulSoup(text, "html.parser").get_text()
    
    # Replace links with the str 'link'
    text = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
                   'link', text, flags=re.MULTILINE)

    # Replace emails with the str 'email'
    text = re.sub('\S+@\S+', 'email', text, flags=re.MULTILINE)
    
    # Convert all letters to lower case.
    text = text.lower()
    
    # Create the stemmer.
    stemmer = SnowballStemmer('english')
    
    # Split text into words.
    words = text.split()
    
    # Remove stopwords.
    words = [w for w in words if w not in stopwords.words('english')]
    
    # Stem words.
    words = [stemmer.stem(w) for w in words]
    
    return words

# Apply the cleaning function to the dataset.
df.text = df.text.progress_apply(clean_text)
/Users/perseas/opt/anaconda3/lib/python3.7/site-packages/tqdm/std.py:668: FutureWarning:

The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version

100%|██████████| 44898/44898 [38:16<00:00, 19.55it/s]  

Create a function to count and return the most frequent words and then plot them in a horizontal bar chart.

In [16]:
# Create a function to count and return the most frequent words.
def frequent_words(label, max_words):
    # Gather text and concatenate.
    text = df[df['label'] == label]['text'].values
    text = np.concatenate(text)
    
    # Count words.
    counts = Counter(text)
    
    # Create a pandas df from the Counter dictionary.
    df_counts = pd.DataFrame.from_dict(counts, orient='index')
    df_counts = df_counts.rename(columns={0:'counts'})
    
    # Return a df with the most frequent words.
    return df_counts.sort_values(by='counts', ascending=False).head(max_words).sort_values(by='counts')

# Get the 50 most frequent words.
df_fake_counts = frequent_words(label='Fake', max_words=50)
df_true_counts = frequent_words(label='True', max_words=50)

# Plot horizontal bar charts.
fig = make_subplots(rows=1, cols=2, subplot_titles=("Fake News", "True News"))

fig.add_trace(go.Bar(x=df_fake_counts.counts.tolist(),
                     y=df_fake_counts.index.values.tolist(),
                     orientation='h', opacity=0.6), 1, 1)

fig.add_trace(go.Bar(x=df_true_counts.counts.tolist(),
                     y=df_true_counts.index.values.tolist(),
                     orientation='h', opacity=0.6), 1, 2)

fig.update_layout(height=900, width=900, title_text="Most Frequent Words", showlegend=False)
fig.update_xaxes(showgrid=False, title_text=None)
fig.update_yaxes(showgrid=False, title_text=None)
fig.show()

Inference

  • We got an idea of the most frequent unigrams and bigrams. We could further continue the analysis of trigrams or even higher n-grams, but the purpose of this study is to build a classification model.

Go to the top

Next Notebook

In the next notebook, we will use these datasets to train a complete fake news classifier. We will extract meaningful features from the text, which we will use to train and deploy a classification model in an AWS SageMaker notebook instance.